# High-resolution image processing
Kimi VL A3B Thinking 2506
MIT
Kimi-VL-A3B-Thinking-2506 is an upgraded version of Kimi-VL-A3B-Thinking, with significant improvements in multimodal reasoning, visual perception and understanding, video scene processing, etc. It supports higher-resolution images and can achieve more intelligent thinking while consuming fewer tokens.
Image-to-Text
Transformers

K
moonshotai
515
67
Style 250412.vit Base Patch16 Siglip 384.v2 Webli
A vision model based on the Vision Transformer architecture, trained using SigLIP (Sigmoid Loss for Language-Image Pretraining), suitable for image understanding tasks.
Image Classification
Transformers

S
p1atdev
66
0
Eva02 Large Patch14 Clip 224.merged2b
MIT
The EVA CLIP model is a vision-language model based on OpenCLIP and timm model weights, supporting tasks such as zero-shot image classification.
Image Classification
E
timm
165
0
Vit Huge Patch14 Clip 378.dfn5b
Other
The visual encoder component of DFN5B-CLIP, based on ViT-Huge architecture, trained with 378x378 resolution images for CLIP model
Image Classification
Transformers

V
timm
461
0
Vit So400m Patch14 Siglip Gap 896.pali2 10b Pt
Apache-2.0
Vision model based on SigLIP image encoder with global average pooling, part of the PaliGemma2 model
Text-to-Image
Transformers

V
timm
57
1
Clip Finetuned Csu P14 336 E3l57 L
This model is a fine-tuned version of openai/clip-vit-large-patch14-336, primarily used for image-text matching tasks.
Text-to-Image
Transformers

C
kevinoli
31
0
Idefics2 8b Chatty
Apache-2.0
Idefics2 is an open multimodal model capable of accepting arbitrary sequences of images and text as input and generating text output. The model can answer questions about images, describe visual content, create stories based on multiple images, or function purely as a language model.
Image-to-Text
Transformers English

I
HuggingFaceM4
617
94
Internvit 6B 448px V1 5
MIT
InternViT-6B-448px-V1-5 is a vision foundation model fine-tuned based on InternViT-6B-448px-V1-2, featuring strong robustness, OCR capabilities, and high-resolution processing.
Text-to-Image
Transformers

I
OpenGVLab
155
79
Idefics2 8b Base
Apache-2.0
Idefics2 is an open-source multimodal model developed by Hugging Face, capable of processing image and text inputs to generate text outputs, excelling in OCR, document understanding, and visual reasoning.
Image-to-Text
Transformers English

I
HuggingFaceM4
1,409
28
Chattruth 7B
ChatTruth-7B is a multilingual vision-language model optimized based on the Qwen-VL architecture, enhanced with large-resolution image processing capabilities and incorporating a restoration module to reduce computational overhead
Image-to-Text
Transformers Supports Multiple Languages

C
mingdali
73
13
Vit Small Patch14 Dinov2.lvd142m
Apache-2.0
A vision Transformer (ViT)-based image feature model pre-trained using self-supervised DINOv2 method on the LVD-142M dataset
Image Classification
Transformers

V
timm
35.85k
3
Vit Base Patch16 224 In21k Eurosat
Apache-2.0
A pre-trained model based on Google's Vision Transformer (ViT) architecture, fine-tuned on the EuroSAT dataset, suitable for remote sensing image classification tasks.
Image Classification
Transformers

V
ingeniou
25
0
Segformer B5 Finetuned Cityscapes 1024 1024
Other
A SegFormer semantic segmentation model fine-tuned on the CityScapes dataset at 1024x1024 resolution, featuring a hierarchical Transformer encoder and a lightweight all-MLP decoder head architecture.
Image Segmentation
Transformers

S
nvidia
31.18k
24
Featured Recommended AI Models